Exploding Gradients
Especially in deep networks and recurrent networks, gradient can become extremely large. This can either update the model in an abrupt way or in case of Adam-like optimizers can mess up the second moment and slow down learning.
In transformers, attention spikes alo
It's solved by:
- Gradient clipping